Sigir 2007
نویسندگان
چکیده
Author identification can be seen as a single-label multi-class text categorization problem. Very often, there are extremely few training texts at least for some of the candidate authors or there is a significant variation in the text-length among the available training texts of the candidate authors. Moreover, in this task usually there is no similarity between the distribution of training and test texts over the classes, that is, a basic assumption of inductive learning does not apply. Previous work [3] provided solutions to this problem for instance-based author identification approaches (i.e., each training text is considered a separate training instance). This work [4] deals with the class imbalance problem in profile-based author identification approaches (i.e., a profile is extracted from all the training texts per author). In particular, a variation of the Common N-Grams (CNG) method, a language-independent profile-based approach [2] with good results in many author identification experiments so far [1], is presented based on new distance measures that are quite stable for large profile length values. Special emphasis is given to the degree upon which the effectiveness of the method is affected by the available training text samples per author. Experiments based on text samples on the same topic from the Reuters Corpus Volume 1 are presented using both balanced and imbalanced training corpora. The results show that CNG with the proposed distance measures is more accurate when only limited training text samples are available, at least for some of the candidate authors, a realistic condition in author identification problems.
منابع مشابه
SIGIR WORKSHOP REPORT Report on the SIGIR 2007 Workshop on Focused Retrieval
On the 27th July 2007 the SIGIR 2007 Workshop on Focused Retrieval was held as part of SIGIR in Amsterdam, the Netherlands. Nine papers were presented in three sessions and in a fourth session there was a panel discussion. This report outlines the events of the workshop and summarizes the major outcomes.
متن کاملSIGIR Announces Member Plus Program
SIGIR is offering a new membership class beginning July 1, 2000. The new "Member Plus" membership package includes conference proceedings from three conferences sponsored by ACM SIGIR: The Member Plus package also includes the benefits of "Basic" membership, such as SIGIR Forum and reduced registration fees at the SIGIR conference. Member Plus membership is an inexpensive method of staying abre...
متن کاملNode Behavior Prediction for Large-Scale Approximate Information Filtering
The workshop is co-located with the 30th Annual International ACM SIGIR Conference 2007 July 23-27, 2007, Amsterdam, Netherlands Hotel Krasnapolsky & University of Amsterdam LSDS-IR 2007 Poster by Christian Zimmer Christian Zimmer1, Christos Tryfonopoulos1, Klaus Berberich1, Gerhard Weikum1, and Manolis Koubarakis2 1 Max-Planck Institute for Informatics, Saarbrücken, Germany | 2 National and Ka...
متن کاملFuture Directions
1. Anick P. Using terminological feedback for Web search refinement – a log-based study. In Proc. 26th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2003, pp. 88–95. 2. Brill E. and Moore R.C. An improved error model for noisy channel spelling correction. In Proc. 38th Annual Meeting of the Assoc. for Computational Linguistics, 2000, pp. 86–293. 3. Croft W.B....
متن کاملEditorial for the 2nd Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL) at SIGIR 2017
متن کامل
SIGIR WORKSHOP REPORT Workshop on Desktop Search
The first SIGIR workshop on Desktop Search was held on 23 July 2010 in Geneva, Switzerland. The workshop consisted of 2 industrial keynotes, 10 paper presentations in a combination of oral and poster format and several discussion sessions. This report presents an overview of the scope and contents of the workshop and outlines the major outcomes.
متن کامل